Football is a very exciting sport. Until now, this is the most popular game on the entire Earth planet. Sorry, not sorry about other games.
I want to review data collected since 1872 trying to understand how matches between countries have evolved up to this moment. So, we are calling to R and a few libraries to help us visualizing data:
library(tidyverse)
library(plotly)
library(lubridate)
The first thing is to read files. I downloaded this project at 2021-07-22 from Kaggle.
results <- read.csv("results.csv", encoding = "UTF-8")
This dataset contains data about \(42k+\) football matches in the history of international encounters between national teams. So, let’s take a little taste of the data:
head(results)
One interesting thing is to take a look at the context of the matches, some of them could be not relevant at all, however, there is also World cup matches, continental tournaments, and so on:
levels(as.factor(results$tournament)) -> tournaments
sample(tournaments,20)
## [1] "Simba Tournament"
## [2] "Island Games"
## [3] "Pan American Championship"
## [4] "Atlantic Heritage Cup"
## [5] "United Arab Emirates Friendship Tournament"
## [6] "CFU Caribbean Cup qualification"
## [7] "Inter Games Football Tournament"
## [8] "CONCACAF Nations League"
## [9] "Intercontinental Cup"
## [10] "UEFA Euro"
## [11] "CONCACAF Championship qualification"
## [12] "Dunhill Cup"
## [13] "Nile Basin Tournament"
## [14] "Brazil Independence Cup"
## [15] "Copa América"
## [16] "Copa Félix Bogado"
## [17] "EAFF Championship"
## [18] "Copa Roca"
## [19] "Atlantic Cup"
## [20] "COSAFA Cup"
Filtering by tournaments with at least 100 matches played in the history:
results %>%
group_by(tournament) %>%
summarise(count=n()) %>%
filter(count > 100) %>%
select(tournament) -> popularCups
results %>%
filter(tournament %in% popularCups$tournament) %>%
ggplot(aes(x=tournament, fill=tournament)) +
geom_bar() +
coord_flip() +
labs(title="Matches in tournaments") -> p
ggplotly(p)
Now we need to process a little bit of the data to assign a standard way to provide points based on the outcome of every match:
| Points | Outcome |
|---|---|
| \(3\) | Victory |
| \(1\) | Tie |
| \(0\) | Defeat |
In FIFA scores, 2 points can be achieved by winning a shootout after a tied match, however, I ignored that for the following analysis
Let’s take a look on how it looks now:
results %>%
mutate(tied=ifelse(home_score == away_score,TRUE,FALSE)) %>%
mutate(home_points=ifelse(tied == TRUE,1,ifelse(home_score > away_score,3,0))) %>%
mutate(away_points=ifelse(tied == TRUE,1,ifelse(home_score > away_score,0,3))) -> results
results %>%
filter(grepl("FIFA World Cup",tournament)) -> worldCupResults
head(worldCupResults)
After this step we also need to transform a little bit the structure of this dataset in order to measure the performance of each National Team in this way:
Then we can see how it looks (for tournaments that contain "FIFA World Cup" in its name).
results %>%
pivot_longer(c(home_team,away_team),names_to = "homeaway", values_to = "team") %>%
mutate(points=ifelse(grepl("home",homeaway),home_points,away_points),
goals=ifelse(grepl("home",homeaway),home_score,away_score),
receivedGoals=ifelse(grepl("home",homeaway),away_score,home_score)) %>%
select(date,tournament,country,team,points,goals,receivedGoals) -> results
results %>%
filter(grepl("FIFA World Cup",tournament)) -> worldCupResults
The most interesting matches occur at FIFA World Cup. So we can focus on what happens in this tournament:
worldCupResults %>% filter(!grepl("qualifi",tournament)) %>% mutate(yr=year(date)) %>% group_by(yr,team) %>% summarise( p=sum(points), goals=sum(goals), against=sum(receivedGoals),matches=n()) %>% mutate( performance=p/matches, ofensive=goals/matches, defense=against/matches) %>% ggplot(aes(x=yr, y=performance, fill=team)) + geom_bar(stat="identity") -> p
## `summarise()` has grouped output by 'yr'. You can override using the `.groups` argument.
ggplotly(p)
Germany emerges as the best in performance over all the matches related to the World Cup. Is not a surprise at all, remember all of the “goleadas” that has produced, in the qualifiers as well as in the knock-out matches in the final stages of the tournament.
Now we can take a look at what happens if we focus only on the final stage, I mean filtering out the qualifiers:
worldCupResults %>% filter(!grepl("qualifi",tournament)) %>% mutate(yr=year(date)) %>% group_by(yr,team) %>% summarise( p=sum(points), goals=sum(goals), against=sum(receivedGoals),matches=n()) %>% mutate( performance=p/matches, ofensive=goals/matches, defense=against/matches) %>% filter(team %in% c("Mexico","Brazil","Argentina","Germany","France")) %>% ggplot(aes(x=yr, y=performance, color=team)) + geom_line() -> p
## `summarise()` has grouped output by 'yr'. You can override using the `.groups` argument.
ggplotly(p)
worldCupResults %>% filter(!grepl("qualifi",tournament)) %>% mutate(yr=year(date)) %>% group_by(yr,team) %>% summarise( p=sum(points), goals=sum(goals), against=sum(receivedGoals),matches=n()) %>% mutate( performance=p/matches, ofensive=goals/matches, defense=against/matches) %>% mutate(differenceGoal=ofensive-defense) %>% ggplot(aes(x=yr, color=team, y=differenceGoal)) + geom_line() -> p
## `summarise()` has grouped output by 'yr'. You can override using the `.groups` argument.
ggplotly(p)